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Abstract 

EquiX is a search language for XML that combines the power of querying with the 
simphcity of searching. Requirements for such languages are discussed and it is shown 
that EquiX meets the necessary criteria. Both a graph-based abstract syntax and a formal 
concrete syntax are presented for EquiX queries. In addition, the semantics is defined and an 
evaluation algorithm is presented. The evaluation algorithm is polynomial under combined 
complexity. 

EquiX combines pattern matching, quantification and logical expressions to query both 
the data and meta-data of XML documents. The result of a query in EquiX is a set of XML 
documents. A DTD describing the result documents is derived automatically from the query. 

1 Introduction 

The widespread use of the World-Wide Web has given rise to a plethora of simple query proces- 
sors, commonly called search engines. Search engines query a database of semi-structured data, 
namely HTML pages. Currently, search engines cannot be used to query the meta-data content 
in such pages. Only the data can be queried. For example, one can use a search engine to find 
pages containing the word "villain". However, it is difficult to obtain only pages in which villain 
appears in the context of a character in a Wild West movie. More and more XML pages are 



*This grant was supported in part by grant 9481-3-00 of the Israeli Ministry of Science. 
^Institute for Computer Science, The Hebrew University, Jerusalem 91904, Israel. 

''Department of Computing and Electrical Engineering, Heriot-Watt University, Edinburgh, EH14 4AS. 
^Department of Computer Science, K. U. Leuven, Celestijnenlaan 200A, B-3001, Heverlee, Belgium. 



1 



Cohen et al. 



2 



finding their way onto the Web. Thus, it is becoming increasingly important to be able to query 
both the data and the meta-data content of the pages on the Web. We propose a language for 
querying (or searching) the Web that fills this void. 

Search engines can be viewed as simple query processors. The query language of most search 
engines is rather restricted. Both traditional database query languages, such as SQL, and newly 



proposed languages, such as XQL ( Robie et al., 199q ), XML-QL ( Deutsch et al., 1998 ) and 
Xmas (Baru et al., 1995; Ludascher et al., 1999| ), are much richer than the query language 
of most search engines. However, the limited expressiveness of search engines appears to be 
an advantage in the context of the Web. Many Internet users are not familiar with database 
concepts and find it hard to formulate SQL queries. In comparison, when it comes to using 
search engines, experience has proven that even novice Internet users can easily ask queries 
using a search engine. It is likely that this is true because of the inherent simplicity of the 
search-engine query languages. 

Consequently, an apparent disadvantage of search-engine languages is really an advantage 
when it comes to querying the Web. Thus, it is imperative to first understand the requirements 
of a query language for the Web, before attempting to design such a language. We believe that 
the Web gives rise to a new concept in query languages, namely search languages. We will 
present design criteria for search languages. 

As its name implies, a search language is a language that can be used to search for data. We 
differentiate between the terms search and query. Roughly speaking, a search is an imprecise 
process in which the user guesses the content of the document that she requires. Querying is a 
precise process in which the user specifies exactly the information she is seeking. In this paper 
we define a language that has both searching and querying capabilities. We call a language that 
allows both searching and querying a search language. 

We call a query written in a search language a search query and the query result a search 
result. Similarly, we call a query processor for a search language a search processor. From 
analyzing popular search engines, one can define a set of criteria that should guide the design 
of a search language and processor. We present such criteria below. 
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1. Format of Results: A search result of a search query should be either a set of documents 
(pages) or sections of documents that satisfy the query. In general, when searching, the 

user is simply interested in finding information. Thus, a search query need not perform 
restructuring of documents to compute results. This simplifies the formulation of a search 
query since the format of the result need not be specified. 

2. Pattern Matching: A search language should allow some level of pattern matching both 
on the data and meta-data. Clearly, pattern matching on the data is a convenient way 
of specifying search requirements. Pattern matching on the meta-data allows a user to 
formulate a search query without knowing the exact structure of the document. In the 
context of searching, it is unlikely that the user will be aware of the exact structure of the 
document that she is seeking. 

3. Quantification: Many search languages currently implemented on the Web allow the 
user to specify quantifications in search queries. For example, the search query "+Wild 
-West", according to the semantics of many of the search engines found on the Web, 
requests documents in which the word "Wild" appears (i.e., exists) and the word "West" 
does not appear (i.e., not exists). The ability to specify quantifications should be extended 
to allow quantifications in querying the meta-data. 

4. Logical Expressions: Many search engines allow the user to specify logical expressions 
in their search languages, such as conjunctions and disjunctions of conditions. This should 
be extended to enable the user to use logical expressions in querying the meta-data. 

5. Iterative Searching Ability: The result of a search query is generally very large. Many 
times a result may contain hundreds, if not thousands, of documents. Users generally do 
not wish to sift through many documents in order to find the information that they require. 
Thus, it is a useful feature for a search processor to allow requerying of previous results. 
This enables users to search for the desired information iteratively, until such information 
is found. 
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6. Polynomial Time: The database over which search queries are computed is large and 
is constantly growing. Hence, it is desirable for a search query to be computable in 
polynomial time under combined complexity (i.e., when both the query and the database 
are part of the input). 

When designing a search language, there is an additional requirement that is more difficult 
to define scientifically. A search language should be easy to use. We present our final criterion. 

7. Simplicity: A search language should be simple to use. One should be able to formulate 
queries easily and the queries, once formulated, should be intuitively understandable. 

The definition of requirements for a search language is interesting in itself. In this paper 
we present a specific language, namely EquiX, that fulfills the requirements ^ through ^. From 
our experience, we have found EquiX search queries to be intuitively understandable. Thus, we 
believe that EquiX satisfies the additional language requirement of simplicity. EquiX is rather 
unique in that it combines both polynomial query evaluation (under combined complexity) 
with several powerful querying abilities. In EquiX, both quantification and negation can be 
used. Regular expressions can also be used on the data of an XML document. In an extension 
to EquiX we allow aggregation on the data and a limited class of regular expressions on the 
metadata. Both searching and querying can be performed using the EquiX language. EquiX 
also simplifies the querying process by automatically generating both the format of the result 
and a corresponding DTD. 

This paper extends previous work ( Cohen et al., 1999 ; Cohen et al., 200C| ). In Section ^ we 



present a data model for XML documents. Both the concrete and abstract syntax for EquiX 
queries are described in Section ^. In Section ^ we define the semantics of EquiX, and in Section ^ 
a polynomial algorithm for evaluating EquiX queries is presented. A procedure for computing a 
result DTD is presented in Section ^. In Section |^ we present some extensions to our language 
and in Section ^ we conclude. We present proofs of theorems in Appendix 
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2 Data Model 

We define a data model for querying XML documents ([Bray et al., 1998|) . At first, we assume 
that each XML document has a given DTD. In Section ^ we will relax this assumption. The 
term element will be used to refer to a particular occurrence of an element in a document. The 
term element name will refer to the name of an element and thus, may appear many times in 
a document. Similarly we use attribute to refer to a particular occurrence of an attribute and 
attribute name to refer to its name. At times, we will blur the distinction between these terms 
when the meaning is clear from the context. 

We introduce some necessary notation. A directed tree over a set of nodes is a pair 
T = (A, E) where E N x N and E defines a tree-structure. We say that the edge (n, n') is 
incident from n and incident to n' . Note that in a tree, there is at most one edge incident to any 
given node. We assume throughout this paper that all trees are finite. The root of a directed 
tree is the node r G A, such that every node in A is reachable from r in T. We denote a rooted 
directed tree as a triple T = (A, E, r). 

An XML document contains both data (i.e., atomic values) and meta-data (i.e., elements 
and attributes). The relationships between data and meta-data, (and between meta-data and 
meta-data) are reflected in a document by use of nesting. 

We will represent an XML document by a directed tree with a labeling function. The data 
and meta-data in a document correspond to nodes in the tree with appropriate labels. Nodes 
corresponding to meta-data are complex nodes while nodes corresponding to data are atomic 
nodes. The relationships in a document are represented by edges in the tree. In this fashion, an 
XML document is represented by its parse tree. 

Note that using ID and IDREF attributes one can represent additional relationships between 
values. When considering these relationships, a document may no longer be represented by a 
tree. In the sequel we will utilize ID and IDREF attributes to answer search queries. 

In general, a parsed XML document need not be a rooted tree. An XML document that 
gives rise to a rooted tree is said to be rooted. The element that corresponds to the root of the 
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tree is called the root element. Given an XML document that is not rooted, one can create a 
rooted document by adding a new element to the document and placing its opening tag at the 
beginning of the document, and its closing tag at the end of the document. This new element 
will be the root element of the new document. With little effort we can adjust the DTD of 
the original document to create a new DTD that the new document will conform to. Thus, we 
assume without loss of generality that all XML documents in a database are rooted. 

We now give a formal definition of an XML document. We assume that there is an infinite 
set A of atoms and infinite set C of labels. 

Definition 2.1 (XML Document) An XML document is a pair X = (T,l) such that 

• T = {N, E, r) is a rooted directed tree,^ 

• I: N ^ £. L) A is a labeling function that associates each complex node with a value in L 
and each atomic node with a value in A. 

We assume that each DTD has a designated element name, called the root element name of 
the DTD. Consider a DTD d with a root element name e. We say that a document X = {T,l) 
with root r strictly conforms to d if 

1. the document X conforms to d (in the usual way ( Pray et al.,T998D ) and 

2. the function / assigns the label e to the root r (i.e., l{r) = e). 

The DTD in Figure^ with root element name movieinf o describes information about movies. 
In Figure || an XML document containing movie information is depicted. This document strictly 
conforms to the DTD resented above. Note that the nodes in Figure ^ are numbered. The 
numbering is for convenient reference and is not part of the data model. 

A catalog is a pair C = {d, S) where d is a DTD and /S is a set of XML documents, each of 
which strictly conforms to d. A database is a set of catalogs. Note the similarity of this definition 

to the relational model where a database is a set of tuples conforming to given relation schemes. 

^Note that an XML document is a sequence of characters. Thus, to properly model the ordering of elements 
in a document, an ordering function on the children of a node should be introduced. For simplicity of exposition 
we chose to omit this in the paper. 



Cohen et al. 



7 



<! ELEMENT movielnfo 

<! ELEMENT movie 

<! ELEMENT actor 

<!ATTLIST actor 

id ID 

<! ELEMENT descr 

<! ELEMENT title 

<! ELEMENT name 

<! ELEMENT character 

<!ATTLIST character 

role CDATA 
star IDREF 



(movie+ , actor+) > 
(descr , title , character+) > 
(name)> 

#REQUIRED> 
(#PCDATA) > 
(#PCDATA)> 
(#PCDATA) > 
EMPTY> 

#REQUIRED 
#REQUIRED> 



Figure 1: DTD describing movie information 
movielnfo 



,„4escr title character 

(6) ^ (7) ^ (8) 




"An Wild role star role star 

exciting..." Wild (2^) I (22) I (23) I (24) I 

West sheriff 724 villain 436 



(2) rnpyie 
\ 

descr title character character 

(10) ^ (11)| (12)/ \_ (13)> <^ 

"takes place The Lone ^^le star role star 

in the Cowboy ^^5)^ (26) | (27) ^ (28) | 

cowboy 436 villain 724 



(5) actor 

/\ 

name id 

(19) I j(20) 

Jack 436 
Redford 



name id 

(17) 1 i(18) 



Wild West.. 



Sam 724 
Mellow 



Figure 2: An XML Document 



This data model is natural and has useful characteristics. Our assumption that each XML 
document conforms to a given DTD implies that the documents are of a partially known struc- 
ture. We can display this knowledge for the benefit of the user. Thus, the task of finding 
information in a database does not require a preliminary step of querying the database to dis- 
cover its structure. 
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3 Search Query Syntax 

In this section we present both a concrete and an abstract syntax for EquiX search queries. A 
search query written in the concrete syntax is a concrete query and a search query written in 
the abstract syntax is an abstract query. 

3.1 Concrete Query Syntax 

The concrete syntax is described informally as part of the graphical user interface currently 
implemented for EquiX. Intuitively, a query is an "example" of the documents that should 
appear in the output. By formulating an EquiX query the user can specify documents that she 
would like to find. She can specify constraints on the data that should appear in the documents. 
We call such constraints content constraints. She can also specify constraints on the meta-data, 
or structure, of the documents. We call such constraints structural constraints. In addition, the 
user can specify quantification constraints which constrain the data and meta-data that should 
appear in the resulting documents by determining how the content and structural constraints 
should be applied to a document. 

The user formulates her query interactively. The user chooses a catalog (d, S) . Only docu- 
ments in S will be searched (queried). At first a minimal query is displayed. In a minimal query, 
only the root element name of d is displayed. A minimal query looks similar to an empty form 
for querying using a search engine (see Figure ^). The user can then add content constraints 
by filling in the form, or add structural constraints by expanding elements that are displayed. 
When an element is expanded, its attributes and subelements, as defined in d, are displayed. 
The user can add content constraints to the elements and attributes. The user can also specify 
the quantification that should be applied to each element and attribute, i.e., quantification con- 
straints. This can be one of exists, not exists., for all, and not for all (written in a user friendly 
fashion). In addition, the user can choose which elements in the query should appear in the 
output. 

In Figure Q an expanded concrete query is depicted. This query was formulated by exploring 
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EquiX Query Form 



File View Constraints 



Help 



J Find each XML document tliat: 
Has a movielnfo ar 



Search Condition Q 



; Enter a search condition: 



Status: 



^ Output: 

^ Aggregation: 



f\ Quantification: 



No 



Wild West 



OK 



Cancel 



Exists 



3 



Figure 3: Minimal query that finds documents containing the phrase "Wild West" 

the DTD presented in Section ^. It retrieves the title and description of Wild West movies in 
which Redford does not star as a villain. Intuitively, answering this query is a two part process: 

1. Search for Wild West movies. The phrase "Wild West" may appear anywhere in the 
description of a movie. For example, it may appear in the title or in the movie description. 
Intuitively, this is similar to a search in a search engine. 

2. Query the movies to find those in which Redford does not play as a villain. This condition 
is rather exact. It specifies exactly where the phrases should appear and it contains a 
quantification constraint. Thus, conceptually, this is similar to a traditional database 
query. 



3.2 Abstract Query Syntax 

We present an abstract syntax for EquiX and show how a concrete query is translated to an 
abstract query. 

A boolean function that associates each sequence of alpha-numeric symbols with a truth 
value among {_L, T} is a string matching function. We assume that there is an infinite set C 
of string matching functions, that C is closed under complement and that the function T is a 
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M Find each XML document that: 



^ IM Has a movielnfo that 

t [3 Has a movie that matches "Wild West" that 
Has a descr and that 
Has a title and that 
<f □ Does not have a character that 

Q Has a role that matches "villian" and that 
Q Has a star that matches "Redford" and that 
•-|M [Double-Clickto Add Conditions] 




Figure 4: Query that finds titles and descriptions of movies in which Redford isn't a viUain 

member of C. We also assume that each function in C is computable in polynomial time. One 
such function might be: 

{T if s contains the words "wild" and "west" 
_L otherwise 

We define an abstract query below. 

Definition 3.1 (Abstract Query) An abstract query is a rooted directed tree T augmented by 
four constraining functions and an output set, denoted Q = (T, I, c, o, q, O) where 

• I : N ^ C is a labeling function that associates each node with a label; 

• c : N ^ C is a content function that associates each node with a string matching function; 

• o : N ^ {a, V} is an operator function that associates each node with a logical operator; 
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• q : E ^ {3,V} is a quantification function that associates each edge with a quantifier; 

• O Q N is the set 0/ projected nodes, i.e., nodes that should appear in the result. 

Consider a node n. If o{n) = "A", we will say that n is an and-node. Otherwise we will say 
that n is an or-node. Similarly, consider an edge e. If q{e) = "3", we will say that e is an 
existential-edge. Otherwise, e is a universal-edge. 

We give an intuitive explanation of the meaning of an abstract query. The formal semantics is 
presented in Section ^. When evaluating a query, we will attempt to match nodes in a document 
to nodes in the query. In order for a document node nx to match a query node ng, the function 
c{nQ) should hold on the data below nx. In addition, if hq is an and-node (or-node), we require 
that each (at least one) child of nq be matched to a child of nx- If nx is matched to ng then 
a child n'x of nx can be matched to a child n'q of nq, only if the edge {nQ^nq) can be satisfied 
w.r.t. nx- Roughly speaking, in order for a universal-edge (existential-edge) to be satisfied w.r.t. 
nx, all children (at least one child) of nx that have the same label as rig must be matched to 

Note that in a concrete query the user can use the quantifiers "3", "V", "-iV" and 

all nodes are implicitly and-nodes. In an abstract query only the quantifiers "3", "V" may be 
used and the nodes may be either and-nodes or or-nodes. When creating a user interface for our 
language we found that the concrete query language was generally more intuitive for the user. 
We present the abstract query language to simplify the discussion of the semantics and query 
evaluation. Note that the two languages are equivalent in their expressive power. 

We address the problem of translating a concrete query to an abstract query. Most of 
this process is straightforward. The tree structure of the abstract query is determined by the 
structure of the concrete query. The labeling function / is determined by the labels (i.e., element 
and attribute names) appearing in the concrete query. The set O is determined by the nodes 
marked for output by the user. 

Translating the quantification constraints is slightly more complicated. As a first step we 
augment each edge in the query with the appropriate quantifier as determined by the user. We 
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associate each node with the "A" -operator and with the content constraint specified by the user. 
Note that an empty content constraint in a concrete query corresponds to the boolean function 
T. Next, we propagate the negation in the query. When negation is propagated through an 
and- node (or- node), the node becomes an or-node (and-node), and the string matching function 
associated with the node is replaced by its complement. Similarly, when negation is propagated 
through an existential-edge (universal-edge), the edge becomes a universal-edge (existential- 
edge). In this fashion, we derive a tree in which each edge is associated with "3" or "V" and 
each node is associated with "A" or "V" . The functions o, g, and c are determined by the process 
described above. 

The concrete query in Figure |^ is represented by the abstract query in Figure ^. The string 
matching functions are specified in italics next to the corresponding nodes. Black nodes are 
output nodes. In the sequel, unless otherwise specified, the term query will refer to an abstract 
query. 




"mid West" [ movie 
A 



AO 



actor 1 true 
A 




character j false 



true 



true 



f role A 


( star ) 




V V y 



< > "villain" < > "Redford" 

Figure 5: Abstract query for the concrete query in Figure ^ Output nodes are colored black. 

Recall the search language requirements we presented in Section |l]. We postulated that in 
a search language, it should not be necessary for the user to specify the format of the result 
(Criterion ||). In EquiX, by defining the set O, the user only specifies what information she 
wants the result to include, and does not explicitly detail the format in which it should appear. 
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We suggested that it is important for there to be pattern matching, quantification, and logical 
expressions for constraining data and meta-data (Criterion ^, ^, and 0). For data, these can 
all be specified using the content function c. For meta-data, the pattern to which the structure 
should be matched is specified by T and /, the quantification is specified by g, and logical 
operators can be specified using o. The result of an EquiX query is a set of XML documents. 
In Section ^ we show how a DTD for the result documents can be computed. Thus, requerying 
of results is possible in EquiX (Criterion |5|). In Section |5| we show that EquiX queries can be 
evaluated in polynomial time, and thus, EquiX meets Criterion ^. 

4 Search Query Semantics 

When describing the semantics of a query in a relational database language, such as SQL or 
Datalog, the term matching can be used. The result of evaluating a query are all the tuples 
that match the schemas mentioned in the query and satisfy the constraints. We describe the 
semantics of an EquiX query in a similar fashion. 

We first define when a node in a document matches a node in a query. Consider a document 
X, and a query Q. Suppose that the labeling function of X is Ix and the labeling function of 
Q \s Iq. We say that a node nx in X matches a node nq in Q if Ixi^x) = Iqinq). We denote 
the parent of a node n by p(n). We now define a matching of a document to a query. 

Definition 4.1 (Matching) Let X = {Tx,lx) be an XML document, with nodes Nx and root 
rx- Let Q = (Tq, lq,c, a, q, O) be a query tree with nodes Nq and root rq. A matching of X to 
Q is a function fi : Nq 2^^ , such that the following hold 

1. Root Matching: n{rq) = {rx}; 

2. Node Matching: if ux G l^-in-q), nx matches nq; 

3. Connectivity: if nx G l-i-inq) and nx is not the root of X, then p{nx) € fi{p{nq)). 
Note that Condition |^ requires that the root of the document is matched to the root of the 
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query, Condition § ensures that matching nodes have the same label, and Condition ^ requires 
matchings to have a tree-like structure. 

We define when a matching of a document to a query is satisfying. We first present some 
auxiliary definitions. Consider an XML document X = (Tx,lx), where Tx = {Nx , Ex ,rx)- 
Consider a node nx in Tx- We differentiate between the textual content (i.e., data) contained 
below the node nx, and the structural content (i.e., meta-data). When defining the textual 
content of a node, we take ID and IDREF values into consideration. We say that n'^ is a child of 
nx if {nx,n'j^) G Ex- We say that n'^ is an indirect child of nx if nx has an attribute of type 
IDREF with the same value as an attribute of type ID of n^. We denote the textual content of 
a node nx as t{nx), defined as follows: 

• If nx is an atomic node, then t{nx) = Ixi^x)', 

• Otherwise, t{nx) is the concatenation^ of the content of its children and indirect children. 

We demonstrate the textual content of a node with an example. Recall the XML document 
depicted in Figure The textual content of Node 9, is "villain 436 Jack Redford". Note that 
the i(24) includes the value "Jack Redford" since Node 5 is an indirect child of Node 24. 

We discuss when a quantification constraint is satisfied. Consider a document X, a query Q 
and a matching of X to Q- Let nx be a node in X and let e = {nQ^n'q) be an edge in Q- We 
say nx satisfies e with respect to ^ if the following holds 

• If e is an existential-edge then there is a child n'^ of nx such that n'-^ matches n'q and 

• If e is a universal-edge then for all children n'-^ of nx, if n'-^ matches n'q, then n'^ € ^{n'q)- 

We define a satisfying matching of a document to a query. 

^Note that an XML document may be cyclic as a result of ID and IDREF attributes. We take a finite 
concatenation by taking each child into account only once. In addition, the order in which the concatenation is 
taken and the ability to differentiate between data that originated in different nodes may affect the satisfiability 
of a string matching function. This is a technical problem that is taken into consideration in the implementation, 
by adding an auxiliary dividing symbol to the data. We will not elaborate on this point any further. 
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Definition 4.2 (Satisfying Matching) Let X = {Tx,lx) be an XML document, and let Q = 
{TQ,lQ,c,o,q,0) be a query tree. Let fi be a matching of X to Q. We say that ji is a satisfying 
matching of X to Q ^/ Z^?^ o-^^ nodes nq in Q and for all nodes nx G l^i^q) the following 
conditions hold 

1. if nq is a leaf then c{nq){t{nx)) = T, i.e., nx satisfies the string matching condition of 

2. otherwise (nq is not a leaf): 

(a) if nq is an or-node then nx satisfies either c{nq) or at least one edge incident from 
nq with respect to fi; 

(h) ifuq is an and-node then nx satisfies both c{nq) and all edges that are incident from 
nq with respect to fi. 

Condition |l] implies that the leaves satisfy the content constraints in Q. Conditions 2a and 2b 
imply that X satisfies the quantification constraints in Q. The structural constraints are satisfied 
by the existence of a matching. 

Example 4.3 Recall the query in Figure H and the document in Figure ^. Two of the satisfying 
matchings of the document to the query are specified in the following table. There are additional 
matchings not shown here. 



Query Node 






movielnfo 


{0} 


{0} 


movie 


{2} 


{3} 


descr 


{10} 


{14} 


title 


{11} 


{15} 


character 


{12,13} 


{16} 


role 


{25,27} 


{29} 


star 


{26,28} 


{30} 


actor 


{4} 


{5} 
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Note that there is no satisfying matching that matches Node 1 to the movie node in the query 
because the universal quantification on the edge connecting movie and character cannot be 
satisfied. 

We presented several satisfying matchings of a document to a query. Let fj, and fi' be 
matchings of a document X to a query Q. We define the union of and in the obvious way. 
Formally, given a query node ng, 

{H U /x')(riQ) := fiinq) U fi'inq) 

There may be an exponential number of satisfying matchings of a given document to a given 
query. Note, however, that the following proposition holds. 

Proposition 4.4 (Union of Matchings) Let X be an XML document and let Q be a query. 
Let M. he the set of all satisfying matchings of X to Q. Then the union of all the satisfying 
matchings in M is a satisfying matching. Formally, 

{[J fi)eM 

Proof. It is sufficient to show that the union of any two satisfying matchings ix\ , /X2 is a satisfying 
matching. Let // := /ii U //2- Suppose that X = (Tx, Ix) with root rx and Q = (Tg, Iq, c, o, O) 
with root rq. 

We first show that is a matching. It is easy to see that the root of the document is 
matched to the root of the label since n{rq) = fii{rq) U lJi2{rq) = {rx} U {rx} = {rx}- Now 
consider ux € fJ'inq) for some document node nx and query node nq. Then nx G /^i(^q) or 
nx £ l^2{nq). In either case it follows that nx matches nq. Similarly, if nx is not the root of 
X then it easily follows that p{nx) € ji{p{nq)). Thus, /i is a matching. 

In a similar fashion it is easy to see that ;u is a satisfying matching. This follows since 
satisfiability is checked for each node separately and thus satisfiability of ji follows directly from 
satisfiability of /ni and //2- □ 
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We say that a document X satisfies a query Q if there exists a satisfying matching of X to 
Q. We now specify the output of evaluating a query on a single XML document. The result of a 
query is the set of documents derived by evaluating the query on each document in the queried 
catalog (i.e., each document that matched the DTD from which the query was derived). 

Intuitively, the result of evaluating a query on a document is a subtree of the document 
(as required in Criterion |l|). The subtree contains nodes of three types. Document nodes 
corresponding to output query nodes appear in the resulting subtree. In addition, we include 
ancestors and descendants of these nodes. The ancestors ensure that the result has a tree-like 
structure and that it is a projection of the original document. Recall that the textual content 
of the document is contained in the atomic nodes of the document tree. Hence, the result must 
include the descendants to insure that the the textual content is returned. 

For a given document, query processing can be viewed as the process of singling out the 
nodes of the document tree that will be part of the output. Consider a document X = {Tx,lx) 
with Tx = {Nx, Ex,rx) and a query Q with projected nodes O. Let A4 be the set of satisfying 
matchings of X to Q. The output of evaluating the query Q on the document X is the the 
document defined by projecting Nx on the set Nr := N^^^^ U A'^anc U X^^^^ defined as 

• A^out ^ I (3^0 £ 0){3fi G A^) nx G fi{no)}, i.e., nodes in X corresponding 
to projected nodes in Q; 

• -A'anc '■= {nx G Xx \ (3n^ G -^out) ™^ ancestor of n^}, i.e., ancestors of nodes in 

^out; 

• -A'desc ^ I (^'^x ^ ^out) is an descendant of n^^}, i.e., descendants of 
nodes in ^Vq^^. 

We call Xr the output set of X with respect to Q. 

The result of applying the query in Figure ^ to the document in Figure |2| is depicted in 
Figure |6|. Note that the values of "descr" and "title" are grouped by "movie". This follows 
naturally from the structure of the original document. 
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movielnfo 



descr title descr title 

\ \ ^ ^ 

"takes The Lone "This Secrets of the 

place in the Cowboy movie..." Wild West 
Wild West..." 



Figure 6: Result Document 

5 Query Evaluation 

Recall that a query is defined by choosing a catalog and exploring its DTD. Consider a query 
Q generated from a DTD d in the catalog (d, S). The result of evaluating Q on the database is 
the set of documents generated by evaluating Q on each document in S. 

We present an algorithm for evaluating a query on a document. There may be an exponential 
number of matchings of a query to a document. Concrete queries contain both quantification 
and negation. This would appear to be another source of complexity. Thus, it would seem that 
computing the output of a query on a document should be computationally expensive. Roughly 
speaking, however, query evaluation in this case is analogous to evaluating a first-order query 



that can be written using only two variables. Therefore, using dynamic programming (Cormen 



et al., 1990), we can in fact derive an algorithm that runs in polynomial time, even when the 
query is considered part of the input (i.e., combined complexity). Thus, EquiX meets the search 
language requirement of having polynomial evaluation time (Criterion |6|). 

In Figure ^ we present a polynomial procedure that computes the output of a document, 
given a query. Given a document X and query Q, the procedure Query_Evaluate computes the 
output set Nji of X w.r.t. Q. Note that the value of t{nx) for each document node nx can 
be computed in a preprocessing step in polynomial time. Query_Evaluate uses the procedure 
Matches shown in Figure ^ Given a query node uq and a document node nx, the procedure 
Matches checks if it is possible that nx G A*(™q) foi' some matching /i, based on the subtrees of 
nq and nx ■ 

Note that path{n) is the sequence of element names on the path from the root of the query 
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Algorithm Query_Evaluate 

Input Document X = {Tx,lx) s.t. Tx = {Nx, Ex,rx), 

Query tree Q = {Tq, Iq, c, o, q, O) s.t. Tq = {Nq, EQ,rQ) 
Output Nji C Nx, i.e., the outputed document nodes 

Initialize match-arrayWW to false 
Queuei := Nq, ordered by descending depth 
While (not isEmpty( Queuei)) do 
uq :=Dequeue(Queuei) 

For all nx G Nx such that path{nx) = path{nQ) do 

match_array[nQ\[nx\'-= Matches(nQ, nx,fnatch_array) 

Nr:=(D 

Queue2 ■= Nq, ordered by ascending depth 
While (not isEmpty(Queue2)) do 

uq :=Dequeue(Queue2) 

For all nx € Nx do 

If (hq 7^ rQ and not match_array[p{nQ)][p{nx)]) then 

match_array[nQ][nx] ■= false 
Else If {match_array[nQ][nx] and ng S O) then 

Nr := Nr U {nx} U anc{nx) U desc{nx) 

Return Nr 



Figure 7: Evaluation of an EquiX Query 

to n, and anc{n) {desc{n)) is the set of ancestors (descendants) of n. Note also that Nq are the 
query nodes and Nx are the document nodes. We use \Nq\ and |A^x| to denote the size of the 
query and document nodes, respectively. The array match_array is an array of size |A''q| x \Nx \ of 
boolean values. Observe that in Figure |^ we order the nodes by descending depth. This ensures 
that when M3X.dne.s{nQ,nx ,match_array) is called, the array match_array is already updated for 
all the children of ng and nx- The procedure Query_Evaluate does not explicitly create any 
matchings. However, the following theorem holds. 

Theorem 5.1 (Correctness of Query_Evaluate) Given a document X and a query Q, the al- 
gorithm Query_Evaluate computes the output set of X w.r.t. Q. 

In Appendix^ we prove this theorem. It can be shown that the procedure Query_Evaluate 
runs in polynomial time in combined complexity. Let |D| be the size of the data in document 
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Procedure 
Input 



Output 



Matches{nQ, nx , match-array) 

A query node uq 

A document node nx 

An array match-array 

true if nx may be in ^{uq) for a matching ^u, 

based on the subtrees of nx and ng, and false otherwise 



tc := c{nQ){t{nx)) 

If nQ is a terminal node return tc 

Let Mq be the set of children of nq in Q 

For each mq G Mq do: 

Let Mx be the set of children, mx, of nx in X such that lx{mx) = ^qW^q) 

If {nq,mq) is an existential-edge then 

status{mq) := MmxeMx match_array[mq][mx] 

Else status{mq) := l\mx&Mx match.array[mq][mx] 
If nQ is an or-node then 

return tcViy^^^^g status{mq)) 
Else return ic A (AmggMQ status(mq)) 



X, i.e., the size of X when ignoring X's meta-data. Formally, = \t{rx)\- Let C(rn) be an 
upper-bound on the runtime of computing a string-matching constraint on a string of size m. 
Recall that C{m) is polynomial in m. 

Theorem 5.2 (Polynomial Complexity) Given document X and a query Q, the algorithm 
Query.Evaluate runs in time 0{\Nx\ ■\Nq\- {\Nq\ ■ \Nx\ + C{\D\))). 

Proof. The initialization stage, i.e., the sorting of Queuei can be done in 0{\Nq\lg{\Nq\)). The 
first "while" loop runs 0{\Nx\\Nq\) times and in each iteration calls the procedure Matches 
which runs in time 0{\Nq\\Nx\ + C{\D\)). Once again, initialization of Queue2 can be done in 

0(|A'^g|/(7(|A''Q|)). The second while loop runs in time 0(|A^xP|^q|)- Therefore, the algorithm 



Query.Evaluate runs in time 0{2\Nq\lg{\Nq\) + |iVx||iVQ|(|iVQ||iVx| + C{\D\)) + \Nx\^\Nq\), 



Figure 8: Satisfaction of a Node Procedure 



which is equal to 0{\Nx\\Nq\{\Nq\\Nx\ + C{\D\))) as required. 



□ 
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6 Creating a Result DTD 

In Section |5| we described the process of evaluating a query on a database. Query evaluation 
generates a set of documents. A query is formed using a chosen DTD, called the originating 
DTD, and only documents strictly conforming to the originating DTD will be queried. Thus, 
in order to allow iterative querying or requerying of results, a DTD for the resulting documents 
must be defined. Given a query Q, if any possible result document must conform to the DTD 
dji, we say that dji is a result DTD for Q. In this section we present a procedure that given a 
query Q, computes in polynomial time a result DTD for Q. Thus, we show that EquiX fulfills 
the search language requirement of ability to perform requerying (Criterion ^). 

A DTD is a set of element definitions, and attribute list definitions. An element definition 
has the form <! ELEMENT e ip>, where e is the element name being defined and ip is its content 
definition. An attribute list definition has the form <!ATTLIST e Vi • • • V'n^j where e is an 
element name and Tpi . . .ipn are definitions of attributes for e. The set of element names defined 
in a DTD d is its element name set, denoted £d- 

Consider a query Q = (T, l,c,o, q, O) formulated from a DTD d. We say that element name 
e' is a descendant of element name e in d if e' may be nested within an element e in a document 
conforming to d. Formally, e' is a descendant of e if 

• e' appears in the content definition of e or 

• e' is a descendant of an element name e" which appears in the content definition of e. 

We say that e is an ancestor of e' in d if e' is a descendant of e in d. Note that the element 
name e may appear in a document resulting from evaluating Q if there is a node no € O such 
that l{no) = e. Additionally, e may appear in the output if e is an ancestor or descendant in 
d of an element e' that meets the condition presented in the previous sentence. Thus, given a 
query, we can compute in linear time the element name set £d^ of the result DTD dn. 

In order to compute the result DTD of a query Q, we must compute the content definitions 
and attribute list definitions for the elements in Sd^^. In the result DTD, we take the attribute 
list definitions for the elements in as defined in the originating DTD but change all attributes 
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to be of type #IMPLIED. Note that the root element name r of the originating DTD will always 
be in E^r- It is easy to see that r is also the designated root element name of dR. 

In Figure ^ we present an algorithm that computes the content definition for an element in 
S-dji- Intuitively, any elements that will not appear in the result of a query must be removed 
from the original DTD in order to form the result DTD. In addition, elements will only appear 
in result documents if query constraints are satisfied. Thus, this possible appearance of elements 
may be taken into account when formulating d/j. The algorithm Create_Content_Definition uses 



the procedure presented in Figure 10 in order to simplify the content definition it creates. The 
result DTD is created by computing the content definitions for all e € Edj^ and adding the 
attribute list definitions. Note that in the algorithm, dtd-desc{e' , e) is true if e is a descendant 
of e' in the DTD D. In addition, anc{nx, O) is true if nx is an ancestor of some node in O. 



Algorithm Create_Content_Definition 

Input An element e E £dji 

A query Q with nodes Nq, edges Eq and projected nodes O 
The originating DTD d with content definition for e 

Output The content definition of e in the result DTD 

If {3no G O) s.t. i{l{no) = e) or 

{l{no) = e' and dtd-desc{e' , e))) then 

f ■= fe 

Else (/? := 

For each (ng G Nq) s.t. {{linq) = e) and {anc{nQ,0))) do 

if' ■= y?e 

For all elements e' in ip^ do 

If (3nQ G Nq) s.t. path{nQ) = path{nQ) and 

{Buq G Nq) s.t. {riQ^riQ) G Eq and /(ng) = e' and {anc{nQ,0))) then 
Replace all occurrences of e' in ip' with (e'?) 
Else Replace all occurrences of e' in with 
Lp:=<p\ip' 
Return Simplify((/?). 



Figure 9: Content Definition Generation Algorithm 
Theorem 6.1 (Correctness of DTD Creation) Let Q be a query with an originating DTD 
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Procedure Simplify((y9) 

Input Content definition ip 

Output Simplified content definition of ip 

While there is a change in 

Apply the following rules to if or its subexpressions 

1. (t,0) => t 5. (0)? ^ 

2. {$,t) t 6. (0)* 

3. (t I 0) ^ t? 7. (0)+ => 

4. (0 I t) tl 
If = then Return EMPTY 
Else Return ip 



Figure 10: Definition Simplifying Algorithm 

d and let X be a document. Suppose that the result of evaluating Q on X is the XML document 
R. Then R strictly conforms to the result DTD as formed by the process described above. In 
addition, the computation of the result DTD can be performed in time 0{\d\\Q\). 

Proof. We first prove correctness. Consider a specific occurence of an document node ur with 
with an element name of e appearing in a result document. Clearly, an element with name e can 
appear in a result document only if e G Sd^. Thus, e has a content definition in the result DTD. 
The content definition of e is a disjunction of content definitions. It is sufficient to show that 
one of the definitions in the disjunction is satisfied with respect to the children of ur. There are 
three possible causes for this occurence of ur in the result document: 

1. There is a matching /i such that ur G fJ-ing) for some output node ng in the query. 
Thus, there is an output node ng in the query with label e. Therefore, according to the 
algorithm, we take the original definition of e as one of the disjuncts of the new definition 
of e. Note that in this case, all of n^'s children will appear in the result. Thus, the children 
of Ur satisfy the definition of e in the result DTD. 

2. The node ur is a descendant of a node matched to an output query node and Case 1 does 
not hold. This case can be proved in the same manner as the previous case. 



Cohen et al. 



24 



3. The node ur is matched to a query node nq that is an ancestor of an output node and 
Cases 1 and 2 do not hold. Note that it is possible that some of n/j's children in the 
document do not appear in the result. Specifically, a child of n/j with element name e' 
cannot appear in the result if there is no query node Uq with the same path from root as 
uq and with a child labeled with e' that is an ancestor of an output node. (This easily 
follows from the definition of the output of a query.) In the content definition that we 
create for e according to nq these elements are replaced by the empty element since they 
cannot appear in the output. All other elements are made optional by the addition of the 
"?" symbol. Thus, clearly the content definition defined according to nq will be satisfied 
by the children of n^. 

Thus, the algorithm is correct. 

The algorithm Create_Content_Definition is called at most each time for a different element 
name e. The algorithm Create_Content_Definition then goes over the nodes in the query with 
label e. For each such node, a content definition is created which is of size 0(|(i|). Thus, when 
amortizing the cost of the creation over all the query nodes, we derive that the result DTD can 
be created in time 0{\d\\Q\ + \d\) = 0{\d\\Q\). □ 



Note that it follows from Theorem 6.1 that the result DTD is polynomial in the size of the 
original DTD and the query. The compactness of the result DTD makes the requerying process 
simpler, since requerying entails exploring the result DTD. 



According to Theorem 3.1, the resulting documents conform to the result DTD. The question 
arises as to how precisely the result DTD describes the resulting documents. In order to answer 
this question we define a partial order on DTDs ( Papakonstantinou and Velikhov, 1999| ). Given 
a DTD d we denote the set of XML documents that strictly conform to d as conf{d). Given 
DTDs d and d' we say that d is tighter than d', denoted d ^ d' , if conf{d) C conf(d'). We say 
that d is strictly tighter that d' , denoted d -< d' , if conf{d) C conf[d'). 

Intuitively, it would be desirable to find a result DTD dn that is as tight as possible, un- 
der the restriction that all possible result documents must strictly conform to dn. However, 
our algorithm does not necessarily find the tightest possible result DTD. In other words, our 
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algorithm may create a result DTD dn although there exists a DTD d'^ to which all resulting 
documents must strictly conform and d'^ -< d^. If dn is the tightest possible result DTD, we 
call dfi a minimal result DTD. A minimal result DTD may not be unique. For a comprehensive 
discussion, see ( Papakonstantinou and Velikhov, 1999| ). According to the following Proposition, 



there is a query and DTD for which a minimal result DTD must be exponential in the size of 
the original DTD. 

Proposition 6.2 (Exponential Result DTD) There is a query Q created from an originat- 
ing DTD d, such that if d' is a minimal result DTD of Q, then d' is of size 0{ \d\\ ). 

Proof. Consider a DTD d with the root element r. Suppose that d contains the element definition 

<! ELEMENT r (ai, . . . , afc)*> 

Let Q = {Tq, Iq,c, o, q, O) be a query with with root and children ni, . . . , n^. Suppose that 
l{nr) = r and /(nj) = Oj for all i. We assume that all the nodes are and-nodes and all the edges 
are existential-edges. Suppose in addition that O = {ni, . . . ,nfc} and c maps each node to an 
arbitrary condition. We can conclude that each of the element names oi, . . . ,aA; will appear at 
least once in any result document. However, these element names they can appear in any order. 
Thus, a minimal result DTD must consider k\ different orderings of the elements, proving the 
claim .0 □ 

Observe that an exponential blowup of the original DTD is undesirable for two reasons. 
First, creating such a DTD is intractable. Second, if the result DTD is of exponential size, then 
it is difficult for a user to requery previous results. Thus, our algorithm for creating result DTDs 
actually returns a convenient DTD, although it is not always minimal. 



^The reader should recall that the order of the document nodes defines the order of the result document nodes, 
while the order of the query nodes has no influence on the result. The need for an exponential size DTD hinges 
on this fact. 



Cohen et al. 



26 



7 Extending EquiX Queries 

EquiX can be extended in many ways to yield a more powerful language. In this section we 
present two extensions to the EquiX language. These extensions add additional querying ability 
to EquiX. After extending EquiX, the search language requirements |l| through ^ are still met. 
However, it is a matter of opinion if EquiX still fulfills Criterion ^ requiring simplicity of use. 
Thus, these extensions are perhaps more suitable for expert users. 

7.1 Adding Aggregation Functions and Constraints 

We extend the EquiX language to allow computing of aggregation functions and verification of 
aggregation constraints. We call the new language EquiX'^^^. 

We extend the abstract query formalism to allow aggregation. An aggregation function is 
one of {count, min, max, sum, avg}. An atomic aggregation constraint has the form f6v where 
/ is an aggregation function, 9 € {<, <, =, 7^, >, >}, and v is a constant value. An aggregation 
constraint is a (possibly empty) conjunction of atomic aggregation constraints. In EquiX'^s^ a 
query is a tuple (T, l,c,o, q, a, ac, O) as in EquiX, augmented with the following functions: 

• a is an aggregation specifying function that associates each node with a (possibly empty) 
set of aggregation functions; 

• Oc is an aggregation constraint function that associates each node with an aggregation 
constraint. 

Given a node nq, the aggregation specifying functions a^nq) (and similarly the aggregation 
constraint functions) are applied to t{nx) for all nx € ^{uq). Note that the min and max 
functions can only be applied to an argument whose domain is ordered. Similarly, the aggregation 
functions sum and avg can only be applied to sets of numbers. There is no way to enforce such 
typing using a DTD (although it is possible using an XML Schema ( Ponsortium, | )). When a 
function is applied to an argument that does not meet its requirement, its result is undefined. 
The function a adds the computed aggregation values to the output. This is similar to 
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placing an aggregation function in the SELECT clause of an SQL query. The function Cc is used 
to further constrain a query. This is similar to the HAVING clause of an SQL query. 

In order to use an aggregation function in an SQL query, one must include a GROUP- 
BY clause. This clause specifies on which variables the grouping should be performed, when 
computing the result. To simplify the querying process, we do not require the user to specify 
the GROUP-BY variables. EquiX*^^^ uses a simple heuristic rule to determine the grouping 
variables. Suppose that ainq) ^ for some node nq. Let no be the lowest node above nq in 
Q for which one of the following conditions hold 

• no ^ O or 

• no is an ancestor of a node in O. 

Note that no is the lowest node above nq where both textual content and aggregate values may 
be combined. EquiX^^^ groups by no when computing the aggregation functions on nq. In a 
similar fashion, EquiX'^ss performs grouping in order to compute aggregation constraints. 

Our choice for grouping variables is natural since it takes advantage of the tree structure 
of the query, and thus, suggests a polynomial evaluation algorithm. It is easy to see that 
adding aggregation functions and constraints does not affect the polynomiality of the evaluation 
algorithm. The algorithm for creating a result DTD must also be slightly adapted in order to 
take into consideration the aggregation values that are retrieved. Thus, EquiX'^^s meets the 
search language requirements ^ through ^. 

7.2 Querying Ontologies using Regular Expressions 

In EquiX, the user chooses a catalog and queries only documents in the chosen catalog. It 
is possible that the user would like to query documents conforming to diff'erent DTDs, but 
containing information about the same subject. In EquiX^'^'s this ability is given to the user. 
Thus, EquiX'''^^ is useful for information integration. 

An ontology, denoted O, is a set of terms whose meanings are well known. Note that 
an ontology can be implemented using XML Namespaces ( |Bray et al., 1999 ). We say that a 
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document X can he described by O if some of the element and attribute names in X appear 
in O. When formulating a query, the user chooses an ontology of terms that describes the 
subject matter that she is interested in querying. Documents that can be described by the 
chosen ontology will be queried. A query tree in EquiX'^''^ is a tuple (T, /, c, o, q, O) as in EquiX. 
However, in EquiX'^'^^, / is a function from the set of nodes to O. 

Semanticly, an EquiX'^'^^ query is interpreted in a different fashion from an EquiX query. 
Each edge is implicitly labeled with the "+" symbol. Intuitively, an edge in a query corresponds 
to a sequence of one or more edges in a document. We adapt the definition of satisfaction of an 
edge (presented in Section ^) to reflect this change. 

Consider a document X, a query Q and a matching /i of X to Q. Let nx be a node in X 
and let e = [nq^n'o) be an edge in Q. We say nx satisfies e with respect to /i if the following 
holds 

• If e is an existential-edge then there is a descendent n'-^ of nx such that matches h'q 
and G //(rig). 

• If e is a universal-edge then for all descendents of nx, if n'-^ matches n'q, then n'-^ G 

Note that the only change in the definition was to replace the words child and children with 
descendent and descendents. 

In a straightforward fashion, we can modify the query evaluation algorithm to reflect the 
new semantics presented. The new algorithm remains polynomial under combined complexity. 
Note that it is no longer possible to create a result DTD if we do not permit a result DTD to 
contain content definitions of type ANY. This results from the possible diversity of the documents 
being queried. However, EquiX''°s still meets Criterion |5| (i.e., ability to requery results), since 
the resulting documents can be described by the chosen ontology. Thus, EquiX^*^^ meets the 
search language requirements || through ^. 
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8 Conclusion 

In this paper we presented design criteria for a search language. We defined a specific search 
language for XML, namely EquiX, that fulfills these requirements. Both a user-friendly con- 
crete syntax and a formal abstract syntax were presented. We defined an evaluation algorithm 
for EquiX queries that is polynomial even under combined complexity. We also presented a 
polynomial algorithm that generates a DTD for the result documents of an EquiX query. 

We believe that EquiX enables the user to search for information in a repository of XML 
documents in a simple and intuitive fashion. Thus, our language is especially suitable for use 
in the context of the Internet. EquiX has the ability to express complex queries with negation, 
quantification and logical expressions. We have also extended EquiX to allow aggregation and 
limited regular expressions. To summarize, EquiX is unique in its being both a powerful language 
and a polynomial language. 

Several XML query languages have been proposed recently, such as XML-QL ( Deutsch et al.J 



1998| ), XQL ( [Robie et al., 1998|) and Lorel ( Poldman et al., 1999| ). These Ian guages are powerful 



in their querying ability. However, they do not fulfill some of our search language requirements. 
In these languages, the user can perform restructuring of the result. Thus, the format of the 
result must be specified, in contradiction to Criterion |l[ Furthermore, XML-QL and XQL are 
limited in their ability to express quantification constraints (Criterion P). Most importantly, none 
of these languages guarantee polynomial evaluation under combined complexity (Criterion 

As future work, we plan to extend the ability of querying ontologies and to allow more 
complex regular expressions in EquiX. XML documents represent data that may not have a strict 
schema. In addition, search queries constitute a guess of the content of the desired documents. 
Thus, we plan to refine EquiX with the ability to deal with incomplete information ( |Kanza| 
et al., 1999| ) and with documents that "approximately satisfy" a query. Search engines perform 



an important service for the user by sorting the results according to their quality. We plan on 
experimenting to find a metric for ordering results that takes both the data and the meta-data 
into consideration. 
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As the World-Wide Web grows, it is becoming increasingly difficult for users to find desired 
information. The addition of meta-data to the Web provides the ability to both search and query 
the Web. Enabling users to formulate powerful queries in a simple fashion is an interesting and 
challenging problem. 
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A Correctness of Query _Evaluate 

In this Section we prove the correctness of the algorithm Query_Evaluate presented in Figure 0. 
We first prove some necessary lemmas. 

Lemma A.l Let nx be a node in a document X and let nq he a node in a query Q. If there 
exists a matching /j, of X to Q such that nx € IJ-i^q) then path{nx) = path{nQ). 

Proof. Suppose that nx G ^{jiq). We show by induction on the depth of uq that path{nx) = 
path{nQ). Suppose that the labeling function of Q is Iq and that the labeling function of X \slx- 



Case 1: Suppose that uq is the root of Q. According to Condition |2| in Definition 4.1 



it 



holds that Iginq) = lx{nx)- According to Condition |^ in Definition |4.1| , nx is the root of X. 
Thus, clearly the claim holds. 

Case 2: Suppose that ng is of depth m. Once again, according to Condition || in Defi- 



nition 11, it holds that Iqinq) = lx{nx)- In addition, p{nx) G ^{p{nQ)) (see Condition y 
in Definition |4.1| ). Note that p^ng) is of depth m — 1. Thus, by the induction hypothesis, 
path{p{nx)) = path{p{nQ)). Our claim follows. □ 

We present an auxiliary definition. We define the height of a query node rig, denoted h{nQ), 



as 



if Uq is a leaf node 

max{h{n'Q)\n'Q is a child of ng} + 1 otherwise 
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We show that the algorithm imphcitly defines a satisfying matching ^n. The nodes that are 
returned are those corresponding to output nodes in /i/j, and their ancestors and descendents. 
We define the function /ij? : Nq — > 2^^ in the following way: 

nx £ fJ'Rin'Q) <^=^ match_array[nQ,nx]= "true" 

Note that we consider the values of match_aiiay at the end of the evaluation of Query_Evaluate. 
We call the retrieval function of Query_Evaluate w.r.t. X and Q. 

Lemma A. 2 (Retrieval Function is a Satisfying Matching) Let X be a document, let Q 
be a query and let fiR be the retrieval function o/ Query_Evaluate w.r.t. X and Q. Then is a 
satisfying matching of X to Q. 



Proof. We show that /Ur is a matching, i.e., that jjR meets the conditions in Definition 4.1 



Roots Match: The only node that has the same path as rg is rx- Thus, the only 
time that the function Matches is called for the root of the query is with the root of the 
document. Thus, the value of /i/j(rQ) must either be either {rx} or 0. However, it is 
easy to see that if ^R{rQ) = then Query_Evaluate returns 0. Thus, it must hold that 
IJ'RirQ) = {rx}- 

Node Matching: If ux € ^^{nQ) then Matches was called with ng and nx- Thus nx 
and nq have the same path, and hence, nx matches nq. 

Connectivity: If match_aTiay[p{nq) ,p{nx)] does not hold, then match_array[nq ,nx] is 
assigned the value "false" . Therefore, clearly the connectivity requirement of a matching 
holds. 



We now show that fiR is a satisfying matching (see Definition 4.2). Suppose that nx € 
fJ-nii^Q) for a document node nx and a query node nq. Note that this implies that Matches 
returned the value "true" when applied to nq and nx- We prove by induction on the height of 
nq that the appropriate condition holds. We consider three cases. 
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Suppose that h[nQ) = 0. Then nq is a leaf and c{nQ){t{nx)) must hold as required. 

Suppose that h{nQ) > and that ng is an or- node. The procedure Matches returned the 
value "true" when applied to uq and nx- Therefore, one of the following must hold: 

1. The condition c{nQ){t{nx)) = T holds. 

2. At the time of application of Matches to ng and nx, there was a child niQ of ng 
and a child mx of nx such that match_array[mQ,mx] had the value "true". Note 
that it follows that mx matches mq. The value of match_array[mq ,mx] will not be 
changed to "false" during the evaluation of Query_Evaluate since match_array[nq,nx] 
is "true". Thus, mx € HR{mq). 



In either case Condition 2a from Definition 4.2 holds as required. 



• Suppose that h{nq) > and that nq is an and-node. We omit the proof as it is similar to 
the previous case. 

Thus, is a satisfying matching as required. □ 

We say that a matching ^ contains a matching ^' if fJ,{nq) ^ ^'[nq) for all query nodes nq. 
We will show that the retrieval function contains all other satisfying matchings. 

Lemma A. 3 (Retrieval Function Containment) Let X be a document, let Q he a query 
and let be the retrieval function of Query_Evaluate w.r.t. X and Q. Suppose that ^ is a 
satisfying matching of X to Q. Then fiR contains fi. 

Proof. Suppose that ^ is a satisfying matching. We show by induction on the height of nq that 
fJ'i^q) Q ^ii{nq). We first consider the values assigned to match_array during the first pass (the 
bottom-up pass) of the algorithm. 



Suppose that h{nq) = 0. Suppose that nx G fJ-ninq). Then, according to Lemma A.l 
path{nx) = path{nq). Thus, Matches will be called on nx and nq. The condition 
c{nq){t{nx)) holds. Thus, match_array[nq ,nx] will be assigned the value "true". 
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• Suppose that h{nQ) > and ng is an or-node. Suppose that nx € fiR^UQ). Once again, 
according to Lemma |A.1| path(nx) = pathinq). Thus, Matches wih be cahed on nx and 
uq. It also follows that one of the following must hold: 

1. The value of c{nQ){t{nx)) is "true". Thus, Matches returns true. 

2. There is a child mq of nq and a child mx of nx such that mx matches niQ and 
mx € ^{niQ). Note that h^nig) < h{nQ). Thus, by the induction hypothesis, the 
value of match_airay[mQ,m,x] after the first pass of the algorithm is "true". Thus 
Matches returns true when called on nq and nx- 

• Suppose that h{nQ) > and ng is an and-node. We omit the proof as it is similar to the 
previous case. 

It is easy to see that it follows from the connectivity of /i that if nx G fJ-inq) then the value of 
match_array[nQ ,nx] will not be changed to "false". Thus, n is contained in as required. □ 

We can now prove the theorem required. 

Theorem A. 4 (Correctness of Query_Evaluate) Given document X and a query Q, the algo- 
rithm Query_Evaluate retrieves the output set of X w.r.t. Q. 

Proof. Let X be a document and Q be a query. We show that a document node nx is returned 
by Query_Evaluate if and only if nx is in the output set of X w.r.t. Q. 



"<^=" Suppose that nx is in the output set of X w.r.t. Q. Then there is a satisfying matching 
/i of X to Q such that and an output node ng in Q such that either nx € fJ-in-Q) or nx is an 
ancestor or descendent of a node in fj,{nQ). Let fiR be the retrieval function of Query_Evaluate 
w.r.t. X and Q. According to Lemma |A.3 is contained in fiji. Thus, clearly nx is returned 
by Query_Evaluate. 

"=^" Suppose that nx is returned by Query_Evaluate. Let fi^ be the retrieval function 
of Query_Evaluate w.r.t. X and Q. Then there is an output node nq in Q such that either 
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nx £ fJ-Ring) or nx is an ancestor or descendent of a node in fiji^ng). According to Lemma [A.2| 
is a satisfying matching of X to Q. Thus, nx is in the output set of X w.r.t. Q. □ 



